Journal of the American Medical Informatics Association
Top medRxiv preprints most likely to be published in this journal, ranked by match strength.
Show abstract
BackgroundClinical trial statistical programming is transitioning from manual, study-specific coding toward metadata-driven, automated pipelines. Although general data management transformation has been reviewed, comprehensive synthesis of statistical programming automation--particularly tables, listings, and figures (TLF) generation and validation frameworks--remains limited. This review addresses this gap through systematic evidence synthesis. MethodsWe conducted a structured literature revie...
Show abstract
PurposeLarge language models (LLMs) are used for biomedical text processing, but individual decisions are often hard to audit. We evaluated whether enforcing a mechanically checkable "show your work" quote affects accuracy, stability, and verifiability for trial eligibility-scope classification from abstracts. MethodsWe used 200 oncology randomized controlled trials (2005 - 2023) and provided models with only the title and abstract. Trials were labeled with whether they allowed for the inclusio...
Show abstract
Accessing complex clinical registries traditionally requires SQL programming expertise, limiting data accessibility for non-technical researchers. In this paper, we designed and evaluated whether a text-to-SQL solution based on large language models (LLMs) could enable natural language querying of a real-world clinical registry under strict privacy and security constraints. Using self-hosted, open-source LLMs, we developed a multi-layered optimization framework incorporating metadata enrichment,...
Show abstract
Large language models (LLMs) are increasingly transforming scientific workflows, yet their application to rigorous evidence synthesis remains underexplored. Through the execution of a single Python script, we present a fully automated pipeline leveraging the Claude API to generate systematic reviews from literature search through manuscript completion without human intervention. Our pipeline processes hundreds of papers through iterative API calls for inclusion evaluation, information extraction...
Show abstract
BackgroundInterprofessional teams are central to high quality patient care. However, identifying the clinician primarily responsible for a patient requires labor-intensive methodologies. Although electronic health record (EHR) audit logs offer a scalable alternative, its use for identifying frontline clinicians is underdeveloped. ObjectiveTo develop and validate an algorithm utilizing EHR audit logs to identify the primary frontline clinician per patient day of an encounter and to describe care...
Show abstract
ObjectiveTo comprehensively evaluate the validity of ICD-10-CM codes for both prevalent diagnoses and less common diseases, and to assess the performance of a large language model (LLM)-based system in validating these codes. Materials and MethodsThis retrospective study analyzed hospital admissions from the Medical Information Mart for Intensive Care (MIMIC-IV) database. We developed a validated LLM-based system using GPT-4o, refined through iterative prompt engineering, to assess ICD-10-CM co...
Show abstract
BackgroundSystematic reviews (SRs) are essential for evidence-based medicine but require extensive time and resources for abstract screening. Large language models (LLMs) offer potential for automating this process, yet concerns about data privacy, intellectual property protection, and reproducibility limit the use of cloud-based solutions in research settings. ObjectiveTo evaluate the performance of a locally deployed 20-billion parameter LLM for automated abstract screening in systematic revi...
Show abstract
Structured AbstractO_ST_ABSObjectiveC_ST_ABSThe use of ambient AI documentation tools is rapidly growing in US hospitals and clinics. Such tools generate the first draft of clinical notes from scribed patient-provider conversations, which clinicians can then review and edit before signing into electronic health records (EHR). Understanding how and why clinicians make modifications to AI-generated drafts is critical to improving AI design and clinical efficiency, yet it has been under-studied. Th...
Show abstract
We describe a new custom feature within our Epic Systems electronic health record (EHR) that automates stratified randomization at the point-of-care or order. As a demonstration use-case, we conducted a randomized trial of a provider-facing alert for short-interval HbA1c orders. Over 3 months the alert dramatically reduced repeat orders. This transportable clinical informatics application transforms health systems ability to conduct pragmatic clinical trials and deliver clinical care within the ...
Show abstract
ObjectiveTo analyze the impact of telemedicine on emergency department (ED) utilization among University of Virginia (UVA) Health System patients, examining which patient characteristics predict reduced ED usage and whether telemedicine reduces ED utilization. Materials and MethodsWe used UVA Electronic Health Records and public datasets to establish clinical and contextual features including demographics, comorbidities, insurance status, and community characteristics. UVA patient data were lin...
Show abstract
ImportanceHigh-quality discharge summaries are essential for safe care transitions but contribute substantially to clinician documentation burden and burnout. While retrospective studies suggest large language models (LLMs) can generate clinical summaries of comparable quality to physicians, prospective data on their safety, utility, and impact on clinician well-being in real-world environments are lacking. ObjectiveTo evaluate the safety, utilization, and impact on clinician burden of MedAgent...
Show abstract
BackgroundElectronic health record (EHR)-based phenotyping underpins genome-wide association studies, yet current ICD-code phenotypes rely heavily on manually curated lists such as Phecodes. These definitions are labour-intensive to maintain, inherently subjective, and may omit clinically relevant diagnostic codes, reducing study power. Advances in text embedding models offer an opportunity to automate and standardize ICD-based phenotype construction. MethodsWe developed Phecoder, an ensemble o...
Show abstract
BackgroundLarge language models (LLMs) are increasingly piloted as chat interfaces for chart review and clinical decision support. Although leading models achieve and even exceed physician-level accuracy on exam-style benchmarks such as MedQA, recent perturbation studies show large drops in accuracy after small changes to prompts, distractor content, or answer format. Prior work has not systematically examined how these vulnerabilities unintentionally manifest in clinically realistic settings, i...
Show abstract
Cross-jurisdictional pharmaceutical compliance requires comparative analysis of regulatory requirements across jurisdictions such as the US FDA and Chinas NMPA. Although large language models (LLMs) are increasingly explored for healthcare-related applications, their performance in cross-jurisdictional regulatory comparison has not been systematically characterized using dedicated benchmarks. This study introduces Sino-US-DrugQA, a bilingual benchmark dataset designed to evaluate LLM performance...
Show abstract
Background and aimsClinical LLM deployment is shifting from feasibility to liability, while current guidance largely treats model behavior as a control problem. We tested whether decision-style system prompts shift clinical action thresholds when clinical facts are held constant, and whether these shifts are consistent across settings and models. MethodsWe defined nine physician personas by crossing three ethical orientations (duty-, care-, utilitarian) with three cognitive styles (intuitive, i...
Show abstract
ObjectiveElectronic Health Record (EHR)-based trial emulation can support translation of randomized clinical trial (RCT) evidence into practice, yet emulations often diverge from published RCT results. We hypothesized that these discrepancies are structured and learnable properties of a health systems data-generating process, and that autonomous agentic workflows can generate discrepancies at the scale required for cumulative learning. Materials and MethodsWe developed an agentic trial emulatio...
Show abstract
Large Language Models (LLMs) demonstrate strong performance at medical specialty board multiple-choice question (MCQ) answering, however, underperform in more complex medical reasoning scenarios. This gap indicates a need for improving both LLM medical reasoning and evaluation paradigms. We introduce MedEvalArena, a framework in which LLMs engage in a symmetric round-robin format. Each model generates challenging board-style medical MCQs, then serves in an ensemble LLM-as-judge bench to adjudica...
Show abstract
IntroductionClinical and population decision-making relies on the systematic evaluation of extensive regulatory evidence. The FDA drug reviews provide detailed information on clinical trial design, enrollment criteria, sample size, randomization, comparators, endpoints, and indications. However, extracting these data is resource-intensive and time-consuming. Generative Artificial Intelligence large language models (LLMs) may accelerate the extraction and synthesis of such information. This study...
Show abstract
ObjectiveMuch medical data is only available in unstructured electronic health records (EHR). These data can be obtained through manual (human) extraction or programmatic natural language processing (NLP) methods. We estimate that NLP only becomes economically competitive with manual extraction when there are ~6500 EHRs records. We have found that there is interest from clinicians and researchers in using NLP on projects with fewer records. We examine whether a large language model (LLM) can be ...
Show abstract
Clinical trial statistical programming requires 12-24 FTE-months for a typical Phase 3 study, producing 100-500 tables, listings, and figures (TLFs) across 8-15 ADaM domains. Modern AI coding agents (Augment Code, Claude Code, Cline, Cursor) demonstrate remarkable reasoning capabilities but lack the domain-specific tools needed for clinical programming: they cannot read SAS datasets, parse ADaM specifications, analyze regulatory-relevant log issues, or generate CDISC-compliant code without exten...